{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# EDA Case Study: Titanic"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 1. Task Description\n",
    "Titanic is a classical Kaggle competition. The task is to predicts which passengers survived the Titanic shipwreck. For more detail, refer to https://www.kaggle.com/c/titanic/overview/description."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 2. Goal of this notebook\n",
    "As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/startupsci/titanic-data-science-solutions for a reference. **In this notebook, we will show how dataprep.eda can simplify the eda process using a few lines of code.**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 3. Load data"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "from dataprep.eda import *\n",
    "from dataprep.datasets import load_dataset\n",
    "train_df = load_dataset(\"titanic\")\n",
    "train_df"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 4. Glimpse of the data\n",
    "The first thing we need to do is to rounghly understand the data. I.e., how many columns are available, which columns are categorical, which columns are numerical, and which column contains missing values. In dataprep.eda, all of the above questions could be answered in just one line of code!"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "plot(train_df)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "The plot(df) shows the distribution of each column. For a categorical column, it shows the bar chart with blue color. For a numeric column, it shows the histgorm with gray color. Currently, the column type (i.e., categorical or numeric) is based on the column type in input dataframe. Hence, if some column types is wrongly identified, you could change its type on the dataframe. For example, by calling *df[col] = df[col].astype(\"object\")* you could identify *col* as a categorical column.\n",
    "\n",
    "From the output of plot(df), we know:\n",
    "1. **All Columns**: there are 1 label column *Survived* and 11 feature columns, which are *PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked*. \n",
    "2. **Categorical Columns**: *Survived, PassengerId, Pclass, Name, Sex, Ticket, Embarked*.\n",
    "3. **Numeric Columns**: *Age, SibSp, Parch, Fare*.\n",
    "4. **Missing Values**: From the figure title, we can find there are 3 columns with missing values. I.e., *Age (19.9%), Cabin (77.1%), Embarked(0.2%)*.\n",
    "5. **Label Balance**: From the distribution of *Survived*, we are aware that the positive and negative training examples and not very balanced. There are 38% data with label *Survived = 1*.    "
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 5. Identify useful features\n",
    "After we roungly know the data, next we want to understand how each feature is correlated to the label column. \n",
    "### 5.1 Age, Cabin, and Embarked: features with missing values.\n",
    "We first take a look at features with missing values: age, cabin and embarked. To understand the missing value, we first call plot_missing(df) to see whether the missing values have any underlaying pattern."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "plot_missing(train_df)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "plot_missing(df) shows how missing values are distributed in the input data. From the output, we know that the missing value is uniformly distribution among records, and there is no underlaying pattern. Next, we decide how to handle the missing values: should we remove the feature, remove the rows contain missing values, or filling the missing values? We first analyze whether they are correlated to Survived. If they are correlated, then we may do not want to remove the feature. We analyze the correlation between two columns by calling *plot(df, x, y)*."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "for feature in ['Age', 'Cabin', 'Embarked']:\n",
    "    plot(train_df, feature, 'Survived').show()"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "From the output, we can find that:\n",
    "1. The Age feature is correlated to Survived. Younger people is more likely to be survived.\n",
    "2. The Embarked feature is correlated to survived. Passenger with Embarked = C is more likely to be survived.\n",
    "3. The correlation of Cabin to Survived is unclear, since Cabin contains many missing values (77.1%) and many distinct values (147), hence each distinct value only contains a few useful tuples.\n",
    "\n",
    "Hence, we could safely remove Cabin column. For Age and Embarked feature, we should fill the missing values.\n",
    "\n",
    "**Result: keep Age and Embarked and filling their missing values; remove Cabin.**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 5.2 Pclass, Sex, Fare: features that look related to survived.\n",
    "\n",
    "We then focus on three columns looks related to survived: Pclass, Sex, Fare. It is reasonble to assume that the upper-class passengers (Pclass = 1), the female, and people with high Fare are more likely to be survived. To justify the assumption, we call *plot(df, x, y)* to understand how the three features are realted to survived."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "for feature in ['Pclass', 'Sex', 'Fare']:\n",
    "    plot(train_df, feature, 'Survived').show()"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "From the above output, we can find that our assumptions are justified. I.e., the upper-class passengers, the female, and people with high Fare are more likely to be survived. We should include the three features in our model.\n",
    "\n",
    "**Result: keep Pclass, Sex and Fare**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 5.3 PassengerId, Name, SibSp, Parch, Ticket: left features\n",
    "\n",
    "We now analyze the left features. \n",
    "\n",
    "*PassengerId*: it is just an id of each passenger, so we could drop it.\n",
    "\n",
    "*Name*: there are many duplicates, and it looks not correlated to survival rate, we may drop it.\n",
    "\n",
    "*Ticket*: it contains many duplicates and looks not correlated to survival rate, we may drop it.\n",
    "\n",
    "*SibSp*: not sure whether it is correlated or not.\n",
    "\n",
    "*Parch*: not sure whether it is correlated or not.\n",
    "\n",
    "Hence, we justify whether *SibSp* and *Parch* are correlated to *Survived*."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "for feature in ['SibSp', 'Parch']:\n",
    "    plot(train_df, feature, 'Survived').show()"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "From the output, we find that the plot is different for different *Survived*, hence they are correlated, and we will keep them as the feature.\n",
    "\n",
    "**Result: keep SibSp and Parch; remove PassengerId, Name and Ticket**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 5.4 Result\n",
    "After the processing, we now left the following features that are useful to predicat *Survived*: *Age, Embarked, Pclass, Sex, Fare, SibSp and Parch.*"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 6. Identify Correlated Features\n",
    "For now, we identified the useful features one by one, and removed the useless features. Altough each feature is useful to predict *Survived*, when we consider them together, we may not want correlated features. Hence, we first identity correlated features. This could be done by simply calling *plot_correlation(df)*."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "plot_correlation(train_df)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "From the output, we know that:\n",
    "1. The most correlated columns are Parch and SibSp, with a Pearson correlation 0.41.\n",
    "2. There does not exist two columns that are highly correlated.\n",
    "Hence, we do not need to worry much about correlated features. However, as Parch and SibSp are slightly correlated in both computation and semantics, we may want to construct a new feature named Family, based on Parch and SibSp, which counts the total family members for each passienger.\n",
    "\n",
    "**Result: Construct a new feature Family based on Parch and SibSp.**"
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.7.3 64-bit ('base': conda)"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "interpreter": {
   "hash": "52d78913ba91f8307eeebcfef9b522b98b0ac32d8b0330f277669c4c3a5dd4b4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}